Chinese Word Segmentation in ICT-NLP

نویسنده

  • Shuanglong Li
چکیده

Chinese word segmentation is always much accounted of in ICT-NLP. In this bakeoff, two different systems in ICTNLP participated. The one is SYSTEM_#1 evaluated in three tracks -PK-close, MSR-close and MSR-open, and SYSTEM_#2 PK-open. Through this bakeoff , the development of Chinese segmentation is learned and the problems are found in our systems. 1 System Description Two different systems in ICT-NLP participated the second bakeoff. 1.1 SYSTEM_#1 The SYSTEM_#1 is implemented mainly based on the log-linear model CRFs(Conditional Random Fields). CRFs are arbitrary undirected graphical models trained to maximize the conditional probability of the desired outputs given the corresponding inputs. We cast the segmentation as one of sequence tagging. The conditional probability for the tag sequence given a input Chinese sentence is defined by a linear-chain CRF with parameters 1 2 ... n T t t t 1 2 ... n C c c c 1 2 ... m to be 1 1 1 ( | ) exp ( , , , ) n m m i i i m c P T C f t t C i Z Where c Z is the per-input normalization that makes the probability of all state sequences sum to one. is a feature function which is can be any real number. 1 ( , , , ) m i i f t t C i The most probable tag sequence for an input C, * arg max ( | ) T T P T C is determined using the Viterbi algorithm, An Nbest list of tagging sequences is obtained using modi-fied Viterbi algorithm. Six tags according to the different positions of one character in a word are used in this model, such as #START(beginning of one sentence), B(beginning of one word), M(middle of one word), E(end of one word), and #END(end of one sentence). The feature templates used in this model are listed in Table 1. Description Feature current state , i i t c current & previous states 1, , i i t t ci current & two previous states 2 1 , , , i i i t t t ci state transitions 1, i i t t second previous character 2 , i i t c previous character 1 , i i t c next character 1 , i i t c second next character 2 , i i t c previous two characters 2 1 , , i i i t c c next two characters 1 2 , , i i i t c c previous current & next character 1 1 , , , i i i i t c c t current and previous character 1 , , i i i t c c current and next character 1 , , i i i t c c character types of current & two previous & two next characters 2 1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Boundary Information and Chinese Word Segmentation

Chinese word segmentation could be considered as a problem of word boundary recognition. Word boundary information plays a significant role in human language acquisition and automatic segmentation for Natural Language Processing (NLP). Extraction of word boundary information involves cognitive psychology, computational linguistics, and language education. Methods utilizing word boundary informa...

متن کامل

Chinese Named Entity Recognition Using Role Model

This paper presents a stochastic model to tackle the problem of Chinese named entity recognition. In this research, we unify component tokens of named entity and their contexts into a generalized role set, which is like part-of-speech (POS). The probabilities of role emission and transition are acquired after machine learning on a role-labeled data set, which is transformed from a hand-correcte...

متن کامل

A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation

Word segmentation is a fundamental task for Chinese language processing. However, with the successive improvements, the standard metric is becoming hard to distinguish state-of-the-art word segmentation systems. In this paper, we propose a new psychometric-inspired evaluation metric for Chinese word segmentation, which addresses to balance the very skewed word distribution at different levels o...

متن کامل

FudanNLP: A Toolkit for Chinese Natural Language Processing

The growing need for Chinese natural language processing (NLP) is largely in a range of research and commercial applications. However, most of the currently Chinese NLP tools or components still have a wide range of issues need to be further improved and developed. FudanNLP is an open source toolkit for Chinese natural language processing (NLP), which uses statistics-based and rule-based method...

متن کامل

Normalized Accessor Variety Combined with Conditional Random Fields in Chinese Word Segmentation

The word is the basic unit in natural language processing (NLP), as it is at the lexical level upon which further processing rests. The lack of word delimiters such as spaces in Chinese texts makes Chinese word segmentation (CWS) an interesting while challenging issue. This paper describes the in-depth research following our participation in the fourth International Chinese Language Processing ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005